Introduction

### MDS plot to see relations between counts of different cell lines in the experiment
knitr::include_graphics("data/MDS_A1.png")
\label{fig:MDS}Figure 1. Multidimensional Scaling (MDS) plot with age group coloring, with the clustering of samples based on gene expression data.

Figure 1. Multidimensional Scaling (MDS) plot with age group coloring, with the clustering of samples based on gene expression data.

### Gene representation for each cell line
knitr::include_graphics("data/GeneRep_A1.png")
\label{fig:MDS}Figure 1. Multidimensional Scaling (MDS) plot with age group coloring, with the clustering of samples based on gene expression data.

Figure 1. Multidimensional Scaling (MDS) plot with age group coloring, with the clustering of samples based on gene expression data.

Fetch data

  • Load the normalized counts data aggregated from Assignment 1
  • Create a sampling matrix to run our experiment as explained in the introduction
  • Creating a list of samples
  • This study focuses on how different cell lines are affected by the corresponding genes and therefore our sample will simply be a mix of the 4 cell_lines focused in this study - “HCC1806”, “HCC1954”, “JIMT1”, “MDAMB231”
samples <- data.frame(ID = 1:40, cell_line = c("HCC1806", "HCC1954", "JIMT1", "MDAMB231", "HCC1806", "HCC1954", "JIMT1", "MDAMB231", "HCC1806", "HCC1954", "JIMT1", "MDAMB231", "HCC1806", "HCC1954", "JIMT1", "MDAMB231", "HCC1806", "HCC1954", "JIMT1", "MDAMB231", "HCC1806", "HCC1954", "JIMT1", "MDAMB231", "HCC1806", "HCC1954", "JIMT1", "MDAMB231", "HCC1806", "HCC1954", "JIMT1", "MDAMB231", "HCC1806", "HCC1954", "JIMT1", "MDAMB231", "HCC1806", "HCC1954", "JIMT1", "MDAMB231"))
head(samples)

Differential Gene Expression Analysis

knitr::include_graphics("data/MDS_A1.png")

* Since the data has been normalized, there have not been any outliers from the clusters and the cell lines are well represented in the data. * We can now use the cell_line attribute to calculate the differential data.

Model Design

  • The design matrix fro this experiment will include only cell_lines
  • Without looss of generality, we can use the HCC1806 cell line as our control group (0) and the rest will be labelled with numbered indicators.
  • With the sample matrix, we can then create the DEGList object using the edgeR package.
samples$cell_line <- relevel(as.factor(samples$cell_line), ref = "HCC1806")
design_matrix <- model.matrix(~  samples$cell_line)
head(design_matrix)
##   (Intercept) samples$cell_lineHCC1954 samples$cell_lineJIMT1
## 1           1                        0                      0
## 2           1                        1                      0
## 3           1                        0                      1
## 4           1                        0                      0
## 5           1                        0                      0
## 6           1                        1                      0
##   samples$cell_lineMDAMB231
## 1                         0
## 2                         0
## 3                         0
## 4                         1
## 5                         0
## 6                         0
counts <- normalized_count_data[,2: ncol(normalized_count_data)-1]
counts
dge <- DGEList(counts = counts, group = unique(samples$cell_line))
dge <- calcNormFactors(dge)

Mean-variance Plot

  • We can create a dispersion using the DGE object to use in a mean variance plot and check variance against the control
disp <- estimateDisp(dge, design_matrix)
## Warning in estimateDisp.default(y = y$counts, design = design, group = group, :
## No residual df: setting dispersion to NA
plotMeanVar(disp, 
            show.raw.vars = TRUE, 
            NBline=TRUE,
            ylim=c(0, 60),
            main = "Mean-Variance Plot showing variance of cell lines from control")
## Warning in xy.coords(x, y, xlabel, ylabel, log): 2921 x values <= 0 omitted from
## logarithmic plot
## Warning in plot.window(...): nonfinite axis=2 limits [GScale(-inf,1.77815,..);
## log=TRUE] -- corrected now

Calculate p-values for each of the genes in your expression set. How many genes were significantly differentially expressed? What thresholds did you use and why?

# Calculating P-valiues
# get expression mtrx
expression <- as.matrix(log2(counts))
rownames(expression) <- normalized_count_data$gene
colnames(expression) <- colnames(counts)
# get minimal set and fit using lm
minimal_set <- ExpressionSet(assayData=expression)
fit <- lmFit(minimal_set, design_matrix[1:4])

The threshold used was 0.05 as it gives us a good confidence interval for the p-values and is generally used as a rule of thumb.

Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And Why? How many genes passed correction?

# Using Bergamini method to adjust values and fit bayesian trend
updated_fit <- eBayes(fit, trend=TRUE)
hits <- topTable(updated_fit,
                     coef=ncol(design_matrix[1:4]),
                     adjust.method = "BH", # using bergamini
                     number = nrow(expression))
outputs <- merge(normalized_count_data[1],
                     hits,
                     by.y=0,by.x=1,
                     all.y=TRUE) 
# ranking the values on p-value
outputs <- outputs[order(outputs$P.Value),]
head(outputs)

Pvalues below threshold before adjusting

(p <- length(which(outputs$P.Value < 0.05)))
## [1] 16789

Pvalues below threshold after adjusting

(adj_p <- length(which(outputs$adj.P.Val < 0.05)))
## [1] 16485

The bergamini method was used for hypothesis correction. It is clear that the adjuted outputs are lower than the values before adjusting and not as many passed the correction.

Show the amount of differentially expressed genes using an MA Plot or a Volcano plot. Highlight genes of interest.

In this section we can use a volcano plot to highlight the differencially expressed genes and show genes of interest.

volcano <- data.frame(
    "gene" = outputs$HCC1806,
    "logFC" = outputs$logFC,
    "adj.P.Val" = outputs$adj.P.Val
)
EnhancedVolcano(volcano,
  lab = volcano$gene,
  x = 'logFC',
  y = 'adj.P.Val',
  pCutoff = 0.05,
  title = "Volcano Plot with threshold cutoff of 0.05",
  ylim = c(0,10)
)

Visualize your top hits using a heatmap. Do you conditions cluster together? Explain why or why not.

top_hits <- outputs$HCC1806[outputs$adj.P.Val < 0.05 && abs(outputs$logFC) > 1]
## Warning in outputs$adj.P.Val < 0.05 && abs(outputs$logFC) > 1: 'length(x) =
## 23686 > 1' in coercion to 'logical(1)'

## Warning in outputs$adj.P.Val < 0.05 && abs(outputs$logFC) > 1: 'length(x) =
## 23686 > 1' in coercion to 'logical(1)'
heatmap <- normalized_count_data[which(normalized_count_data$gene %in% top_hits),][1:4]
heatmap <- na.omit(t(scale(t(heatmap)))[1:50,])
heatmap <- Heatmap(as.matrix(heatmap),
      show_row_names = FALSE)
heatmap

Thresholded over-representation analysis

Which method did you choose?

We will choose to run an enrichment analysis to compare for over representation of genes *Find the top up regulated genes

top_up <- outputs$HCC1806[outputs$logFC > 1 && outputs$adj.P.Val < 0.05]
## Warning in outputs$logFC > 1 && outputs$adj.P.Val < 0.05: 'length(x) = 23686 >
## 1' in coercion to 'logical(1)'

## Warning in outputs$logFC > 1 && outputs$adj.P.Val < 0.05: 'length(x) = 23686 >
## 1' in coercion to 'logical(1)'
query_res <- gost(query = top_up, organism = "hsapiens")
query_res$result$query <- "up regulated"
gostplot(query_res)

What annotation data did you use and why? What version of the annotation are you using?

We use GO dataset for annotation as it is the one that has the most hits with our outputs and is one of the most commonly used datasets.

res <- query_res$result[query_res$result$source == "GO:BP", ]
## Use a dot plot to show visualization of enrichment analysis
ggplot(res[1:20,]) + 
  geom_point(aes(
    x = precision, 
    color = p_value,
    y = term_name,
    size = intersection_size)) +
  labs(title = "Enrichment analysis on up regulated genes from dataset")

### Down-regulated genes

top_down <- outputs$HCC1806[outputs$logFC < -1]
query_res <- gost(query = top_down,
                organism = "hsapiens")
gostplot(query_res)

Again, we use GO:BP to compare with it being most representative and plot the dot plot using ggplot

res_down <- query_res$result[query_res$result$source == "GO:BP", ]
ggplot(res_down[1:20,]) + 
  geom_point(aes(
    x = precision, 
    color = p_value,
    y = term_name,
    size = intersection_size)) +
  labs(title = "Enrichment analysis on down regulated genes from dataset")

Interpretation

Do the over-representation results support conclusions or mechanism discussed in the original paper?

The over-representation results seen here do support the conclusions in the original paper that the knocking out metabolic genes from cell lines can affect different cell lines in different ways. There were similarities between the control cell line used in the experiment and the HCC1954 cell line while the other two were farther away on the MDS plots. As seen in our analysis the different cell lines are differencially expressed with some being clustered together.

Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.

I have not been able to find any similar experiments to support the results seen here.

References